Aims
Aims of clustering
Unsupervised learning consists of statistical methods to extract meaning from data without categorising any of the data before. In this sense, unsupervised learning expects the machine to decide what the categories are. that Clustering data can be described as ‘the art of finding groups in data’. The main reasons for using this kind of technique are:
- to identify distinct groups within a data set
- as an extension of exploratory data analysis
- to gain insight into how the data and how variables relate to one another
Given that OAs are purely administrative inventions - can we see meaningful groups emerge? Or is the data too noisy? The aim of this part of the analysis is to run, explain and measure the performance of buildings clusters on our data set. I have focused this part of the analysis on building age and type.
Buildings Data
At the moment, this data is only for Lewisham Council. I have run the clustering algortithm on buildings data, which has been split into categories for type and age.
Type: Flat block | Converted Flats | House (detached/semi-detached) | Terraced house
Age: Victorian/pre-WW1 | Interwar | Postwar (1945-1979) | Modern (1980-)
For each OA, the percentage of addresses that fall into each of the above categories is calculated. The clusters are based on these variables. What we are looking for here is whether or not OAs fit into neat categories of building types.
Algorithm
This analysis is carried out using k-means clustering, where we need to specify the number of categories, then the algorithms find that number of categories in the data. This algorithm starts off divides the data into K categories (a number chosen by whoever is running the algorithm) and finds centres in the data that minimise the sum of the squared distances from the data points to those centres.
Main findings
Five different clusters of different buildings seemed to emerge from the data, that separated out:
- Victorian terraced housing, often converted into flats
- High proportions of flat blocks built in the post-war (1945-79)
- Terraced houses built between the wars
- Modern flat blocks
- Areas with more houses, mostly interwar
The two interwar categories have the highest levels of voter registration. The modern flat blocks have the lowest. The combination of higher than average deprivation, a younger population and more renters mean that t
The clustering here is not as well defined as we thought it might be. We think there are two main reasons for this:
- A lot of development have been quite locally based, so large areas with the same type of housing is not very widespread
- OAs are a quite arbitrary boundaries, drawn up for administrative purposes not to reflect any kind of neighbourhood, which makes this geographic area arguably unsuitable for clustering. Looking at streets might be better.
Frequency of clusters
Labelling the clusters
The algorithm produced the clusters, which I have named according to their characteristics (see next page to explore them yourself). It should be noted that in each of the clusters, a variety of different housing can be found. I have named them according to how prominent different types/ages of housing appear as compared to their average share.
Frequency
|
Cluster
|
Label
|
Frequency
|
|
1
|
pw_flats
|
202
|
|
2
|
iw_houses
|
67
|
|
3
|
mdn_flats
|
102
|
|
4
|
iw_terraces
|
131
|
|
5
|
vct_conv
|
382
|
Relationship with voter registration
The distribution of voter registration for each of the clusters is shown below. In these box plots the interquartile range is represented by the box, then median is shown as the line in the middle and the separate dots are the outliers. This shows that the ‘modern flats’ has the lowest levels and the two interwar clusters have the highest.
Silhouette
About this chart
This ‘silhouette chart’ represents how well the clustering model works. In very well clustered data, each of the bars would not go below the red dotted line (and certainly wouldn’t be below the 0 line!)
This chart means that most of the data does fit quite well into the five clusters, but a lot of it doesn’t. As we know the data is very noisy, and OAs are not ideal to perform clustering. The message from this chart is that quite a few areas are so mixed that you just cannot categorise them easily.
